Paraphrase Identification by Text Canonicalization
نویسندگان
چکیده
This paper proposes an approach to sentencelevel paraphrase identification by text canonicalization. The source sentence pairs are first converted into surface text that approximates canonical forms. A decision tree learning module which employs simple lexical matching features then takes the output canonicalized texts as its input for a supervised learning process. Experiments on the Microsoft Research (MSR) Paraphrase Corpus give comparable performance to other systems that are equipped with more sophisticated lexical semantic and syntactic matching components, with a Confidence-weighted Score of 0.791. An ancillary experiment using the occurrence of nominalizations suggests that the MSR Paraphrase Corpus might not be a rich source for learning paraphrasing patterns.
منابع مشابه
Learning to Recognize Ancillary Information for Automatic Paraphrase Identification
Previous work on Automatic Paraphrase Identification (PI) is mainly based on modeling text similarity between two sentences. In contrast, we study methods for automatically detecting whether a text fragment only appearing in a sentence of the evaluated sentence pair is important or ancillary information with respect to the paraphrase identification task. Engineering features for this new task i...
متن کاملIdiom Paraphrases: Seventh Heaven vs Cloud Nine
The goal of paraphrase identification is to decide whether two given text fragments have the same meaning. Of particular interest in this area is the identification of paraphrases among short texts, such as SMS and Twitter. In this paper, we present idiomatic expressions as a new domain for short-text paraphrase identification. We propose a technique, utilizing idiom definitions and continuous ...
متن کاملRe-examining Machine Translation Metrics for Paraphrase Identification
We propose to re-examine the hypothesis that automated metrics developed for MT evaluation can prove useful for paraphrase identification in light of the significant work on the development of new MT metrics over the last 4 years. We show that a meta-classifier trained using nothing but recent MT metrics outperforms all previous paraphrase identification approaches on the Microsoft Research Par...
متن کاملConstructing a Canonicalized Corpus of Historical German by Text Alignment ---draft
Historical text presents numerous challenges for contemporary natural language processing techniques. In particular, the absence of consistent orthographic conventions in historical text presents difficulties for any system requiring reference to a static lexicon indexed by orthographic form. Canonicalization approaches seek to address these issues by assigning an extant equivalent to each word...
متن کاملMore than Words: Using Token Context to Improve Canonicalization of Historical German
Historical text presents numerous challenges for contemporary natural language processing techniques. In particular, the absence of consistent orthographic conventions in historical text presents difficulties for any system requiring reference to a fixed lexicon accessed by orthographic form, such as information retrieval systems (Sokirko, 2003; Cafarella and Cutting, 2004), part-of-speech tagg...
متن کامل